Skip to content

Conversation

@kayunder
Copy link

Proposed draft for Agent Behavior Hijacking.

[Title of Your PR]

Key Changes:

  • List major changes and core updates
  • Keep each line under 80 characters
  • Focus on the "what" and "why"

Added:

  • New features/functionality
  • New files/configurations
  • New dependencies

Changed:

  • Updates to existing code
  • Configuration changes
  • Dependency updates

Removed:

  • Deleted files/code
  • Removed dependencies
  • Cleaned up configurations

Proposed draft for Agent Behavior Hijacking.

Signed-off-by: kayunder <[email protected]>
Copy link
Collaborator

@itskerenkatz itskerenkatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loved it! super concise and yet detailed and practical!
I have added some comments and thoughts

**Description:**

A brief description of the vulnerability that includes its potential effects such as system compromises, data breaches, or other security concerns.
AI Agents require autonomous ability to plan and execute tasks to achieve a goal. The independently initiated chain of events that occur throughout the activity path of an Agent can often be described as the “behavior” of the agent. Due to the overall weakness in the agent’s ability to determine its own instructions against the possible nefarious injection of instructions that would direct the agent to behave in an unintended way, the intended behavior of the agent is susceptible to manipulation. This inherent weakness is due to the use of natural language processing within the AI components of the agent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not only due to the language processing concept, but also due to the methods in which the alignment and RLHF processes are being done, right?

A brief description of the vulnerability that includes its potential effects such as system compromises, data breaches, or other security concerns.
AI Agents require autonomous ability to plan and execute tasks to achieve a goal. The independently initiated chain of events that occur throughout the activity path of an Agent can often be described as the “behavior” of the agent. Due to the overall weakness in the agent’s ability to determine its own instructions against the possible nefarious injection of instructions that would direct the agent to behave in an unintended way, the intended behavior of the agent is susceptible to manipulation. This inherent weakness is due to the use of natural language processing within the AI components of the agent.

The OWASP LLM01:2025 Prompt Injection risk is also highly relevant to Agent Behavior Hijacking. Prompt injection occurs when malicious input alters an LLM’s behavior or output in unintended ways. For an autonomous AI agent, a well-crafted prompt injection can override system instructions, tricking the agent into taking harmful actions, disclosing secrets, or executing commands that were never intended. In this way, prompt injection directly facilitates the hijacking of an agent’s decision-making process, making it one of the most potent enablers of Agent Behavior Hijacking.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting question to me is wether or not we want to target or even mention that this could be either prompt injection that has a specific target or harm to achieve, or a jailbreak that aims to completely override the agent's guardrails (I think it is quite common to distinguish between the two and it may be helpful for the readers), what do you think? I do agree that at the end of the day these are two manipulations results in harmful risky consequences, but maybe worth mention it?


The OWASP LLM01:2025 Prompt Injection risk is also highly relevant to Agent Behavior Hijacking. Prompt injection occurs when malicious input alters an LLM’s behavior or output in unintended ways. For an autonomous AI agent, a well-crafted prompt injection can override system instructions, tricking the agent into taking harmful actions, disclosing secrets, or executing commands that were never intended. In this way, prompt injection directly facilitates the hijacking of an agent’s decision-making process, making it one of the most potent enablers of Agent Behavior Hijacking.

Additionally, When correlated to the OWASP Agentic AI Threats and Mitigations Guide, there are a few threats that can be directly linked to Agent Behavior Hijacking. Specifically, T01 – Memory Poisoning, T02 – Tool Misuse, T06 – Goal Manipulation, and T07 – Misaligned & Deceptive Behaviors all describe scenarios where an attacker subverts an agent’s autonomy and decision-making.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loved it!

2. Example 2: Another instance or type of this vulnerability.
3. Example 3: Yet another instance or type of this vulnerability.
1. Example 1: Indirect Prompt Injection via hidden instruction payloads embedded in web pages or documents silently redirect an agent to exfiltrate sensitive data or misuse connected tools.
2. Example 2: Indirect Prompt Injection via email hijacks an agent’s internal mail capability, sending unauthorized messages under a trusted identity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add: "an email sent from outside of the organization" to emphasize how easy it is to preform the attack from outside of the company as an attacker

3. Example 3: Yet another instance or type of this vulnerability.
1. Example 1: Indirect Prompt Injection via hidden instruction payloads embedded in web pages or documents silently redirect an agent to exfiltrate sensitive data or misuse connected tools.
2. Example 2: Indirect Prompt Injection via email hijacks an agent’s internal mail capability, sending unauthorized messages under a trusted identity.
3. Example 3: System Prompt Override manipulates core instructions to reorient the agent’s objectives toward attacker-defined outcomes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we'll be even more specific in here?
I think if we can link to a specific attack pattern that will be more feasible for the readers it might be helpful.
Also - I think mentioning the outcome of hijacking a workflow is important in here!
Just an example can be "CEO injection attack", or a clients serving bot refunding a user with much greater refund etc. what do you think?

1. Prevention Step 1: A step or strategy that can be used to prevent the vulnerability or mitigate its effects.
2. Prevention Step 2: Another prevention step or strategy.
3. Prevention Step 3: Yet another prevention step or strategy.
1. Prevention Step 1: Establish continuous monitoring of agent activity throughout the chain of actions to build a known baseline of behavior. This baseline will allow for alerts to be triggered when the behavior of the agent strays from the established historical pattern.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am loving it.
I think not only from historical pattern but also from intended goal, right?

2. Prevention Step 2: Another prevention step or strategy.
3. Prevention Step 3: Yet another prevention step or strategy.
1. Prevention Step 1: Establish continuous monitoring of agent activity throughout the chain of actions to build a known baseline of behavior. This baseline will allow for alerts to be triggered when the behavior of the agent strays from the established historical pattern.
2. Prevention Step 2: Incorporate AI Agents into the established Insider Threat Program to monitor the behavior against established baselines and allow for investigation in case of outlier activity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also:
Surface any insider prompts intended to get access to sensitive data or to alter the agent behavior?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another super interesting topic that is not discussed enough: how about when users are doing a reconnaissance to eventually perform an adversarial? Asking lots of questions about the agents goals and boundaries to eventually attack it - I think it worth looking for it too as a preventive mitigation.

3. Prevention Step 3: Yet another prevention step or strategy.
1. Prevention Step 1: Establish continuous monitoring of agent activity throughout the chain of actions to build a known baseline of behavior. This baseline will allow for alerts to be triggered when the behavior of the agent strays from the established historical pattern.
2. Prevention Step 2: Incorporate AI Agents into the established Insider Threat Program to monitor the behavior against established baselines and allow for investigation in case of outlier activity.
3. Prevention Step 3: Prevention Step 3: Ensure the weights of the goals in the Agent system prompt are balanced accurately to ensure the behavior of the agent will adhere to the intent of the builders. This will help in identifying possible issues.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will help to prevent some of the issues :)

Scenario #1: EchoLeak — Zero-Click Indirect Prompt Injection - An attacker emails a crafted message that silently triggers Microsoft 365 Copilot to execute hidden instructions, causing the AI to exfiltrate confidential emails, files, and chat logs without any user interaction.

Scenario #2: Another example of an attack scenario showing a different way the vulnerability could be exploited.
Scenario #2: Operator Prompt Injection via Web Content - An attacker plants malicious content on a web page that the Operator agent processes, tricking it into following unauthorized instructions. The Operator agent then accesses authenticated internal pages and exposes users’ private data, demonstrating how lightly guarded autonomous agents can leak sensitive information through prompt injection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to add one around workflow hijacking
My folks have created the CEO injection attack that really explains it, but they just posted it on Linkedin or something.. I can ask them to get it documented somewhere so we can refer to it because I think it's really an important message - or of course if you have any other example or demonstration to it I think it'll be great!

Copy link
Contributor

@almogbhl almogbhl Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itskerenkatz @kayunder can this be useful here?

Visual Studio Code & Agentic AI workflows RCE – Sep 2025 Command injection in agentic AI workflows can let a remote, unauthenticated attacker cause VS Code to run injected commands on the developer’s machine. ASI01+ASI02+ASI05
Google Gemini Trifecta — Cloud Assist, Search Model & Browsing (Sep 2025) Indirect prompt injection through logs, search history, and browsing context can trick Gemini into exposing sensitive data and carrying out unintended actions across connected Google services. ASI01+ASI02

Copy link

@cjj884 cjj884 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend scenario title be "Operator Indirect Prompt Injection via Web Content" reflecting indirect nature of injection and consistent with example provided above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants